Merged Jupyter Notebook

DOMAIN:

Automobile

CONTEXT:

The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes

DATA DESCRIPTION:

The data concerns city-cycle fuel consumption in miles per gallon

Attribute Information:

  1. mpg: continuous
  2. cylinders(cyl): multi-valued discrete
  3. displacement(disp): continuous
  4. horsepower(hp): continuous
  5. weight(wt): continuous
  6. acceleration(acc): continuous
  7. model year(yr): multi-valued discrete
  8. origin: multi-valued discrete
  9. car name: string (unique for each instance)

PROJECT OBJECTIVE:

PROJECT OBJECTIVE: Goal is to cluster the data and treat them as individual datasets to train Regression models to predict ‘mpg’ Steps and tasks: [ Total Score: 25 points]

  1. Import and warehouse data: [ Score: 3 points ] • Import all the given datasets and explore shape and size. • Merge all datasets onto one and explore final shape and size. • Export the final dataset and store it on local machine in .csv, .xlsx and .json format for future use. • Import the data from above steps into python.
  2. Data cleansing: [ Score: 3 points ] • Missing/incorrect value treatment • Drop attribute/s if required using relevant functional knowledge • Perform another kind of corrections/treatment on the data.
  3. Data analysis & visualisation: [ Score: 4 points ] • Perform detailed statistical analysis on the data. • Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis. Hint: Use your best analytical approach. Even you can mix match columns to create new columns which can be used for better analysis. Create your own features if required. Be highly experimental and analytical here to find hidden patterns.
  4. Machine learning: [ Score: 8 points ] • Use K Means and Hierarchical clustering to find out the optimal number of clusters in the data. • Share your insights about the difference in using these two methods.
  5. Answer below questions based on outcomes of using ML based methods. [ Score: 5 points ] • Mention how many optimal clusters are present in the data and what could be the possible reason behind it. • Use linear regression model on different clusters separately and print the coefficients of the models individually • How using different models for different clusters will be helpful in this case and how it will be different than using one single model without clustering? Mention how it impacts performance and prediction.
  6. Improvisation: [ Score: 2 points ] • Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the company to perform a better data analysis in future.

American 4 cylinder cars produced in 1973 with medium mpg level seem to dominate the dateset

Displacement and Horsepower seem to be skewed to the right

There appears to be a linear relationship between the variables

Checking of Outliers

Except Year, most of the variable are correlated with each other

Hirerachical Clustering

Appears to be to much of a visual clutter, we'll go ahead and cut down the dendrogram to give us 2 clusters/groups

Clearly shows two disting group with a difference in average between the clusters and variables

K-Means Clustering

Clearly shows two disting group with a difference in average between the clusters and variables

Linear Regression on the Original Dataset

Linear Regression on Data with K Means Cluster

Linear Regression on Data with H-Clusters

Summary

  1. K-means appears to explain the highest variation in the datset, but with a difference of only 1% when compared with other models, to get more clarity a larger dataset may be used, since this is a dataset of used cars it doesn't give us how many previous owners has the cars seen which might be helful variable,the gender of the previous owners, the reason/purpose that the cars were being used is also an important factor which the dataset doen't capture.

  2. With the above mentioned features it may be possible to get a higher accuracy or explainability of the models and its variables.

  3. Linear Regression with K Means clusters gives us 91 % accuracy

  4. Linear Regression with Hirerachial Clusters givs us 90 % accuracy

  5. Both models perform well

Volume

Description:

Volume is the base in the Big Data 5V's if we see it as a Pyramid. The volume of data that companies manage skyrocketed around 2012, when they began collecting more than three million pieces of data every data. “Since then, this volume doubles about every 40 months

Observation:

  1. The Volume of this data set is very small. We have 9 features of information in this dataset.
  2. The dataset has 9 Columns X 398 Rows of data

Recommendation:

  1. The dataset could have had more volume in terms of both features and size of the dataset
  2. To predict the mpg with a limited data might not necessarily yield a right prediction

Velocity:

Description:

Velocity is termed as the flow of information as quickly – as close to real-time as possible. Velocity can be more important than volume because it can give us a bigger competitive advantage. Sometimes it’s better to have limited data in real time than lots of data at a low speed.

Observation:

  1. The dataset has a limited information and depending upon the age of the data we can conclude that it is pretty old
  2. The dataset isn't a realtime data as the feature Yr of manufature seems the data belongs to the year 70-82

Recommendation:

  1. To Predict the mpg the source of the data could have been realtime

Variety:

Description:

Variety is defined as the collection of data from many different sources: from in-house devices to smartphone GPS technology or what people are saying on social networks. The importance of these sources of information varies depending on the nature of the business. For example, a mass-market service or product should be more aware of social networks than an industrial business.

Observation:

The dataset has limited variety of information A) mpg: continuous B) cylinders: multi-valued discrete C) displacement: continuous D) horsepower: continuous E) weight: continuous F) acceleration: continuous G) model year: multi-valued discrete H)origin: multi-valued discrete I) car name: string (unique for each instance)

Recommendation:

To improve the variety of data some additional information like the below could have been sourced A) transmission type B) Fuel type
C) Service history

Veracity:

Description:

Veracity is equivalent to quality. We have all the data, but could we be missing something? Are the data “clean” and accurate? Do they really have something to offer?

Observation:

  1. The dataset has 9 features of which 8 features are either float64 or int64 whislt only one feature is a object data type
  2. We observe lot of inconsistencies in the car names which needs to be cleaned up before we perform EDA

Recommendation:

  1. The car names shall be standardized whilst sourcing the data as we have only specific models in the market
  2. Special characters shall be eliminated in the car names and shall be replaced with a space " "

Value:

Value sits at the top of the big data pyramid. This refers to the ability to transform a tsunami of data into business.

Description:

With the limited information provided in the dataset it's difficult to interprete the values into meaningful predicitions

Recommendation:

  1. Mpg is the desired variable in this dataset which needs to be predicted with the other discrete variables like Cycliners, displacement, hp, wt etc.
  2. We could have had meaningful predictions if the volume of data is increased and the additional variety of data like transmission type, Fuel type, Service historyis provided would have been provided

QED

DOMAIN:

Manufacturing

CONTEXT:

Company X curates and packages wine across various vineyards spread throughout the country.

DATA DESCRIPTION:

The data concerns the chemical composition of the wine and its respective quality.

Attribute Information:

  1. A, B, C, D: specific chemical composition measure of the wine
  2. Quality: quality of wine [ Low and High ]

PROJECT OBJECTIVE:

Goal is to build a synthetic data generation model using the existing data provided by the company. Steps and tasks: [ Total Score: 5 points]

  1. Design a synthetic data generation model which can impute values [Attribute: Quality] wherever empty the company has missed recording the data

the chemical compositions are on the same scales between 0 to 200

There appears to be no misclassification when checking the it with the non missing target variables and the predicted clusters, Hence the new labels can be used as a target variable

Part 3

1. Observe dataset

There are many missing values. We will choose to impute these null values instead of dropping them.

There are few null values in almost every column,we will try to impute this with mean value this would slightly bias the dataset but won't impact the model significantly.

We have imputed all the rows with their mean values.

There is a data imbalance as car has a significantly higher count as compared to bus and van. While van has the lowest count.

Most of these columns share a strong correlation with each other. We can also notice certain outliers in this data.

2.EDA

Let's analyze every column in detail.

2.1 Compactness

Compactness column shows an approximately normal distribution curve while having no outliers.

2.2 Circularity

The distribution plot has 3 peaks and is right skewed.

2.3 Distance Circularity

Here the distribution plot has 2 peaks and is left skewed

2.4 Radius Ratio

Here it is right skewed and there is presence of outliers.

Performing outlier analysis

2.5 Pr.Axis aspect ratio

The distribution is approx normally distributed as well as right skewed and there is presence of outliers.

2.6 Max length aspect ratio

There are 2 peaks and significant no. of outliers are observed.

2.7 Scatter ratio

There are 2 peaks observed and the distribution is right skewed.

2.8 Elongatedness

There are 2 peaks observed and the distribution is left skewed.

2.9 Pr.Axis rectangularity

There are 2 peaks observed and the distribution is right skewed.

2.10 Max length rectangularity

There are 3 peaks observed and no outliers in the data.

2.11 Scaled variance

2.12 Scaled variance.1

The distribution plot shows 2 peaks with outliers.

2.13 Scaled radius of gyration

It has an approximate normal distribution with no outliers.

2.14 Scaled radius of gyration.1

2.15 Skewness about

2.16 Skewness about.1

2.17 Skewness about.2

2.18 Hollows ratio

There are 2 peaks observed and no outliers in the data.

2.19 Class

There is some imbalance in terms of class. This will have some impact on model.

3. EDA Summary of Independent and dependent variables

We can infer,

We have chosen to not remove the outliers as the no. of outliers is insignificant and it wouldn't bias the model. However dropping these outliers may give us a better model. Assuming these outliers were artificially introduced during data collection.

Correlation Matrix

Here there are some columns which share high collinearity while some share low collinearity. We can drop the columns whichever share high collinearity as there's no use in considering multiple columns. We will use Principal Component Analysis to determine which columns can be dropped.

4. Principal Component Analysis

We have 8 components in this model that explain 95% variation in this data.

5. Support Vector Machines

SVM model gives an accuracy of 93%

Using Grid Search

Accuracy score is 83% using Grid Search, however SVM gives an accuracy of 93%

Using Grid Search we can infer the model accuracy is 83%.

6. Naive Bayes Algorithm

Naive Bayes model gives an accuracy of 76%

7. Model Comparison

8. Confusion Matrix

We can conclude SVM is the best suited algorithm as it gives a good accuracy score of 93% on test data and confusion matrix gives a good classification.

To conclude PCA helps to reduced 18 dimension data into 8 dimension data with SVM score of 93% accuracy

Project 4

Domain:

Sports Management

Context:

Company X is a Sports Management Company for International Cricket.

Data Description:

The data is collected belongs to batsman from IPL series conducted so far. Attribute Information:

  1. Runs: Runs score by the batsman
  2. Ave: Average runs scored by the batsman per match
  3. SR: strike rate of the batsman
  4. Fours: number of boundary/four scored
  5. Six: number of boundary/six scored
  6. HF: number of half centuries scored so far

Project Objective:

Goal is to build a data driven batsman ranking model for the sports management company to make business decisions. Steps and tasks: [ Total Score: 5 points]

  1. EDA and visualisation: Create a detailed performance report using univariate, bi-variate and multivariate EDA techniques. Find out all possible hidden patterns by using all possible methods.
  2. Build a data driven model to rank all the players in the dataset using all or the most important performance features.

Hint :

EDA

Strike rate, fours, sixes and half centuries have a skewed distribution

There appears to be outliers, will not be treating them as its highly likely that these are genuine observation

All the variable except fours with strike rate, strike rate with half centuries,strike rate with runs, have high correlation

PCA

Computing Eigenvectors and Eigenvalues:¶

Before computing Eigen vectors and values we need to calculate covariance matrix.

Covariance Matrix

Equivalently we could have used Numpy np.cov to calculate covariance matrix

Eigen Decomposition of the Covariance Matrix

Selecting Principal Components

In order to decide which eigenvector(s) can dropped without losing too much information for the construction of lower-dimensional subspace, we need to inspect the corresponding eigenvalues: The eigenvectors with the lowest eigenvalues bear the least information about the distribution of the data; those are the ones can be dropped.

Explained Variance After sorting the eigenpairs, the next question is "how many principal components are we going to choose for our new feature subspace?" A useful measure is the so-called "explained variance," which can be calculated from the eigenvalues. The explained variance tells us how much information (variance) can be attributed to each of the principal components.

The plot above clearly shows that maximum variance (somewhere around 71%) can be explained by the first principal component alone. The second,third,fourth and fifth principal component share almost equal amount of information.Comparatively 6th components share less amount of information as compared to the rest of the Principal components.But those information cannot be ignored since they both contribute almost 8% of the data.But we can drop the last component as it has less than 5% of the variance

Projection Matrix

Projection Onto the New Feature Space In this last step we will use the 6×2-dimensional projection matrix W to transform our samples onto the new subspace via the equation Y=X×W

Dot Product

PCA

PCA with 2 Components

Analyze the Results of PCA

Calculate score for all chose PCs and derieve the ranking mechanism for top performing player performing mathematical operation and derieve the ranking score in single digits

Display the Ranking of Top Performing Players in Decending Order(Top to Bottom)

  1. The X-axis here is our ‘PC 2’. The y-axis, on the other hand, is the first ‘PC 1’.
  1. Thus Principal Component Analysis is used to remove the redundant features from the datasets without losing much information.These features are low dimensional in nature.The first component has the highest variance followed by second. PCA works best on data set having 2 or higher dimensions. Because, with higher dimensions, it becomes increasingly difficult to make interpretations from the resultant cloud of data.
  1. Finally, it is important to note that our data set contained only a few features from the get-go. So, when we further reduced the dimensionality, using ‘P C A’ we found out we only need two components to separate the data. That’s the reason why even a two-dimensional plot is enough to see the separation. This might not always be the case. We may have more features and more components respectively in real life problems or other data sets. Then you might need a different way to represent the results of PCA.

Conclusion

  1. Goal to build a data driven batsman ranking model for the sports management company to make business decisions Business owners can now take decision on the grading system created for the best playing batsman

  2. Sports dataset provided, used the same to implement PCA. Obtained eigen values and eigen vectors from the above & performed a mathematical operation between eigen value, eigen vectors and the original dataset to arrive at a single digit score. We saw and noticed we didnt lose much after dimensional reduction.

  3. We see there is no significant difference in predictions of PCA for best performing player / batsman

  4. Business can take decision basis the PCA to choose the top performing batsman / player for their respective teams

Project 5

Questions: [ Total Score: 5 points]

  1. List down all possible dimensionality reduction techniques that can be implemented using python.
  2. So far you have used dimensional reduction on numeric data. Is it possible to do the same on a multimedia data [images and video] and text data ? Please illustrate your findings using a simple implementation on python.

There are many dimensional reduction techniques available. Please list them.For dmiensional reduction on multimedia data. PFB approach for dimensional reduction for images: Import images into python using PIL or any other python image library

  1. Display the image - actual
  2. Display the image - matrix (hint: Image is a MXN matrix of number )
  3. SL algorithms require a MxN dataframe to classify. Whereas here a single image is MxN, hence making it difficult for SL algorithm to intake data.
  4. Hint solution: Flatten each image i.e. MxN ---> 1X(M*N)
  5. Apply SL algorithm like KNN or SVM or any other algorithm of your choice. Note the accuracy (A1)
  6. Use the image matrix to perform pPCA on it.
  7. Apply the same SL algorithm as used above on the dimensionally reduced data . Note the accuracy (A2)
  8. Compare A1 ~ A2. If they are similar then dimesional reduction has worked on image.

Reference link: https://www.kaggle.com/hamishdickson/preprocessing-images-with-dimensionality-reduction

Solution Part I

List of all Possible Dimensionality Techniques and When to Use Them

Dimensionality Reduction Techniques can be Classified into 2 to 3 types

Feature Selection:

a) Missing Value Ratio: If the dataset has too many missing values, we use this approach to reduce the number of variables. We can drop the variables having a large number of missing values in them.
b) Low Variance filter: We apply this approach to identify and drop constant variables from the dataset. The target variable is not unduly affected by variables with low variance, and hence these variables can be safely dropped.
c) High Correlation filter: A pair of variables having high correlation increases multicollinearity in the dataset. So, we can use this technique to find highly correlated features and drop them accordingly.
d) Random Forest: This is one of the most commonly used techniques which tells us the importance of each feature present in the dataset. We can find the importance of each feature and keep the top most features, resulting in dimensionality reduction.
e) Both Backward Feature Elimination and Forward Feature Selection techniques take a lot of computational time and are thus generally used on smaller datasets.

Components / Factor Based:

a) Factor Analysis: This technique is best suited for situations where we have highly correlated set of variables. It divides the variables based on their correlation into different groups, and represents each group with a factor.
b) Principal Component Analysis: This is one of the most widely used techniques for dealing with linear data. It divides the data into a set of components which try to explain as much variance as possible.
c) Independent Component Analysis: We can use ICA to transform the data into independent components which describe the data using less number of components.

Projection Based:

a) ISOMAP: We use this technique when the data is strongly non-linear.
b) UMAP: This technique works well for high dimensional data. Its run-time is shorter

Solution Part II

Introduction

There already exists a plethora of notebooks discussing the merits of dimensionality reduction methods, in particular the Big 3 of PCA (Principal Component Analysis), LDA ( Linear Discriminant Analysis). Quite a handful of these have compared one to the other but few have gathered all 3 in one go. Therefore this notebook will aim to provide an introductory exposition on these 3 methods as well as to portray their visualisations interactively and hopefully more intuitively via the Plotly visualisation library. The methods are structured and implemented as follows below :

Principal Component Analysis ( PCA ) - Unsupervised, Linear Method

Linear Discriminant Analysis (LDA) - Supervised, Linear Method

Curse of Dimensionality & Dimensionality Reduction

The term "Curse of Dimensionality" has been oft been thrown about, especially when PCA, LDA is thrown into the mix. This phrase refers to how our perfectly good and reliable Machine Learning methods may suddenly perform badly when we are dealing in a very high-dimensional space. But what exactly do all these 3 acronyms do? They are essentially transformation methods used for dimensionality reduction. Therefore, if we are able to project our data from a higher-dimensional space to a lower one while keeping most of the relevant information, that would make life a lot easier for our learning methods.

MNIST Dataset

For the purposes of this interactive guide, the MNIST (Mixed National Institute of Standards and Technology) computer vision digit dataset was chosen partly due to its simplicity and also surprisingly deep and informative research that can be done with the dataset.

So let's load the training data and see what we have

Loading the libraries and MNIST Dataset ( Train & Test Data )

The MNIST set consists of 42,000 rows and 785 columns. There are 28 x 28 pixel images of digits ( contributing to 784 columns) as well as one extra label column which is essentially a class label to state whether the row-wise contribution to each digit gives a 1 or a 9. Each row component contains a value between one and zero and this describes the intensity of each pixel.

Train Data Set Info

Lets check if there are any null values

As it is a classification problem lets check whether we have balanced or imbalanced classes

Lets Scale down Our Data

By Dividing it by Max Value of Pixel i.e. 255

Lets View one of the Image from Dataset

Train and Test Dataset

Train Models on SVM Classifer , Random Forest, KNN, XG Boost Model and Check the Accuracy

For KNN let us check for different neighbours ranging from 1 to 20 and will select neighbour with max accuracy

We can see that we have got max accuracy for neighbour 1 and 3

Out of 4 algo's both SVM AND XG Boost has Perfomed better as compared to KNN and Random Forest

Now lets try reducing our features using PCA and see if our accuracy improves or not

PCA

Principal Component Analysis (PCA)

In a nutshell, PCA is a linear transformation algorithm that seeks to project the original features of our data onto a smaller set of features ( or subspace ) while still retaining most of the information. To do this the algorithm tries to find the most appropriate directions/angles ( which are the principal components ) that maximise the variance in the new subspace. Why maximise the variance though?

To answer the question, more context has to be given about the PCA method. One has to understand that the principal components are orthogonal to each other ( think right angle ). As such when generating the covariance matrix ( measure of how related 2 variables are to each other ) in our new subspace, the off-diagonal values of the covariance matrix will be zero and only the diagonals ( or eigenvalues) will be non-zero. It is these diagonal values that represent the variances of the principal components that we are talking about or information about the variability of our features.

Therefore when PCA seeks to maximise this variance, the method is trying to find directions ( principal components ) that contain the largest spread/subset of data points or information ( variance ) relative to all the data points present.

Calculating the Eigenvectors

Now it may be informative to observe how the variances look like for the digits in the MNIST dataset. Therefore to achieve this, let us calculate the eigenvectors and eigenvalues of the covarience matrix as follows

Standardize the Data

We see that nearly 90% of the explained variance can be explained by 200 features

Plotting Eigen Values

As alluded to above, since the PCA method seeks to obtain the optimal directions (or eigenvectors) that captures the most variance ( spreads out the data points the most ). Therefore it may be informative ( and cool) to visualise these directions and their associated eigenvalues. For the purposes of this notebook and for speed, we will invoke PCA to only extract the top 30 eigenvalues ( using Sklearn's .components_ call) from the digit dataset and visually compare.

Takeaway from the Plots

The subplots above portray the top 30 optimal directions or principal component axes that the PCA method has decided to generate for our digit dataset. Of interest is when one compares the first component "Eigenvalue 1" to the 28th component "Eigenvalue 28", it is obvious that more complicated directions or components are being generated in the search to maximise variance in the new feature subspace.

Plotting MNIST Data

Let's plot the actual MNIST digit set to see what the underlying dataset actually represents, rather than being caught up with just looking at 1 and 0's.

PCA Implementation via Sklearn Now using the Sklearn toolkit, we implement the Principal Component Analysis algorithm as follows:

What the chunk of code does above is to first normalise the data (actually no need to do so for this data set as they are all 1's and 0's) using Sklearn's convenient StandardScaler call.

Next we invoke Sklearn's inbuilt PCA function by providing into its argument n_components, the number of components/dimensions we would like to project the data onto. In practise, one motivate the choice of components for example by looking at the proportion of variance captured vs each feature's eigenvalue, such as in our Explained Variance plots. However for the essence of this example , we take a PCA on 5 components ( against perhaps taking 200 over components).

Finally we call both fit and transform methods which fits the PCA model with the standardised digit data set and then does a transformation by applying the dimensionality reduction on the data.

Visualisation of PCA Representations

Interactive visualisations of PCA representation When it comes to these dimensionality reduction methods, scatter plots are most commonly implemented because they allow for great and convenient visualisations of clustering ( if any existed ) and this will be exactly what we will be doing as we plot the first 2 principal components as follows:

Takeaway from the Plot

As observed from the scatter plot, you can just about make out a few discernible clusters evinced from the collective blotches of colors. These clusters represent the underlying digit that each data point should contribute to and one may therefore be tempted to think that it was quite easy to implement and visualising PCA in this section.

However, the devil lies in the tiny details of the python implementation because as alluded to earlier, PCA is actually in fact an unsupervised method which does not depend on class labels.

As we can see that most of the variance is explained by 150 - 300 PC's so we will take those PC's which can explain about 90% of the variance

Lets Pirst Plot First 2 PC's

We can see that we have got max accuracy for neighbour 1 and 3

As we can see using PCA Accuracy of SVM and KNN has improved little bit, theres no change in the Accuracy of XG Boost and in case of Random Forest Accuracy has dropped

K - Means

Imagine just for a moment that we were not provided with the class labels to this digit set because after all PCA is an unsupervised method. Therefore how would we be able to separate out our data points in the new feature space? We can apply a clustering algorithm on our new PCA projection data and hopefully arrive at distinct clusters which would tell us something about the underlying class separation in the data.

To start off, we set up a KMeans clustering method with Sklearn's KMeans call and use the fit_predict method to compute cluster centers and predict cluster indices for the first and second PCA projections (to see if we can observe any appreciable clusters).

Takeaway from the Plot

Visually, the clusters generated by the KMeans algorithm appear to provide a clearer demarcation amongst clusters as compared to naively adding in class labels into our PCA projections. This should come as no surprise as PCA is meant to be an unsupervised method and therefore not optimised for separating different class labels. This particular task however is accomplished by the very next method that we will talk about.

Using LDA

Linear Discriminant Analysis (LDA)¶

  1. LDA, much like PCA is also a linear transformation method commonly used in dimensionality reduction tasks. However unlike the latter which is an unsupervised learning algorithm, LDA falls into the class of supervised learning methods. As such the goal of LDA is that with available information about class labels, LDA will seek to maximise the separation between the different classes by computing the component axes (linear discriminants ) which does this.

Loading the Results into Submission File

Printing the Confusion Matrix

Concluding Remarks

Thus Principal Component Analysis is used to remove the redundant features from the datasets without losing much information.These features are low dimensional in nature.The first component has the highest variance followed by second. PCA works best on data set having 2 or higher dimensions. Because, with higher dimensions, it becomes increasingly difficult to make interpretations from the resultant cloud of data.

In conclusion, this notebook has introduced and briefly covered three different dimensionality reduction methods commonly used by ML practitioners - PCA, LDA. We touched on the concepts of finding principal components and linear discriminants as well as the topology preserving capabilities of t-SNE. We've also discussed the relative merits of using supervised and unsupervised methods as well as the KMeans clustering technique when it comes to an unsupervised scenario.

Apart from these three common reduction methods, there exists a whole host of other dimensionality reduction methods not discussed in this notebook. Just to name a few, they include methods like Sammon's Mapping, Multi-dimensional Scaling or even some graph based visualisation methods.

What else could have done in this model:

  1. Hyperparameter tuning: As the dataset are quite large so for hyperparameter tuning computation time will be very high
  2. Used roc-auc curve